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Abstract 

Leveraging the coherent exploration of 
Hamiltonian flow, Hamiltonian Monte Carlo 
produces computationally efhcient Monte 
Carlo estimators, even with respect to com¬ 
plex and high-dimensional target distribu¬ 
tions. When confronted with data-intensive 
applications, however, the algorithm may 
be too expensive to implement, leaving us 
to consider the utility of approximations 
such as data subsampling. In this paper 
I demonstrate how data subsampling fun¬ 
damentally compromises the efficient explo¬ 
ration of Hamiltonian flow and hence the 
scalable performance of Hamiltonian Monte 
Carlo itself. 

With the preponderance of applications featuring 
enormous data sets, methods of inference requiring 
only subsamples of data are becoming more and more 
appealing. Subsampled Markov Chain Monte Carlo 
algorithms, (Neiswanger et ah, 2013; Welling & Teh, 
2011), are particularly desired for their potential ap¬ 
plicability to most statistical models. Unfortunately, 
careful analysis of these algorithms reveals unavoid¬ 
able biases unless the data are tall, or highly redundant 
(Bardenet et ah, 2014; Teh et ah, 2014; Vollmer et ah, 
2015). The utility of these subsampled algorithms is 
then a consequence of not only the desired accuracy 
and also the particular model and data under consid¬ 
eration, which can severely restrict practicality. 

Recently (Chen et ah, 2014) considered subsampling 
within Hamiltonian Monte Carlo (Duane et ah, 1987; 
Neal, 2011; Betancourt et ah, 2014b) and demon¬ 
strated that the biases induced by naive subsampling 
lead to unacceptably large biases. Ultimately the au¬ 
thors rectified this bias by sacrificing the coherent ex¬ 


ploration of Hamiltonian flow for a diffusive correction, 
fundamentally compromising the scalability of the al¬ 
gorithm with respect to the complexity of the target 
distribution. An algorithm scalable with respect to 
both the size of the data and the complexity of the 
target distribution would have to maintain the coher¬ 
ent exploration of Hamiltonian flow while subsampling 
and, unfortunately, these objectives are mutually ex¬ 
clusive in general. 

In this paper I review the elements of Hamiltonian 
Monte Carlo critical to its robust and scalable per¬ 
formance in practice and demonstrate how different 
subsampling strategies all compromise those proper¬ 
ties and consequently induce poor performance. 

1. Hamiltonian Monte Carlo in Theory 

Hamiltonian Monte Carlo utilizes deterministic, 
measure-preserving maps to generate efficient Markov 
transitions. Formally, we begin by complementing a 
target distribution, 

vj oc exp[—U(g)] d"( 7 , 

with a conditional distribution over auxiliary momenta 
parameters, 

Wq OC exp[-r(p, g)] d"p. 

Together these define a joint distribution, 

WH OC exp[- {T{q,p) + V{q))] d”gd> 
cxexp[-77(g,p)]d"gd>, 

and a Hamiltonian system corresponding to the Hamil¬ 
tonian, H{q,p). We refer to T{q,p) and V{q) as the 
kinetic energy and potential energy, respectively. 

The Hamiltonian immediately defines a Hamiltonian 
flow on the joint space. 


<('f : (9,P) ^ (9,p),Vt e K 

= 4's+u 
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which exactly preserves the joint distribution, 
{4>?)^'^h = ^H- 

Consequently, we can compose a Markov chain by sam¬ 
pling the auxiliary momenta, 

q {q,p), P ^ 

applying the Hamiltonian flow, 

{q,p) (j)f{q,p) 

and then projecting back down to the target space, 

{q,p) q- 

By construction, the trajectories generated by the 
Hamiltonian flow explore the level sets of the Hamil¬ 
tonian function, which shadow the probability mass of 
the joint distribution. Because these level sets can also 
span large volumes of the joint space, sufhciently-long 
trajectories can yield transitions far away from the ini¬ 
tial state of the Markov chain, drastically reducing au¬ 
tocorrelations and producing computationally efficient 
Monte Carlo estimators. 

When the kinetic energy does not depend on position 
we say that the Hamiltonian is separable, H{q,p) = 
T{p) + V{q), and the exact Hamiltonian flow can be 
generated by the Hamiltonian operator, H, 



where 

^ dH d dH d 

dp dq dq dp 

- 

dp dq dq dp 

= f + V . 

In this paper I consider only separable Hamilto¬ 
nians, although the conclusions also carry over to 
the non-seperable Hamiltonians, for example those 
arising in Riemannian Hamiltonian Monte Carlo 
(Girolami & Calderhead, 2011). 

2. Hamiltonian Monte Carlo in Practice 

The biggest challenge of implementing Hamiltonian 
Monte Carlo is that the Hamiltonian operator is rarely 
calculable in practice and we must instead resort to ap¬ 
proximate integration of the Hamiltonian flow. Sym- 
plectic integrators, which yield numerical trajectories 
that closely track the true trajectories, are of particu¬ 
lar importance to any high-performance implementa¬ 
tion. 


An especially transparent strategy for construct¬ 
ing symplectic integrators is to split the Hamilto¬ 
nian into terms with soluble flows which can then 
be composed together (Leimkuhler & Reich, 2004; 
Hairer et ah, 2006). For example, consider the sym¬ 
metric Strang splitting. 


4 O o o o 

where e is a small interval of time known as the step 
size. Appealing to the Baker-Campbell-Hausdorff for¬ 
mula, this symmetric composition yields 





= e 2 ^ o exp ( eT 


-V 


4 1 

2 r 


T,V 


O(e^) 


= exp(|H + ef+|H+i ^ 


1 


+ 7 


T,V 




= -f 0{e^) . 

Composing this symmetric composition with itself L = 
TIe times results in a symplectic integrator accurate to 
second-order in the step size for any finite integration 
time, T, 





= e'^^ ^TO{e^) 

= -f 0{e^) . 

Remarkably, the resulting numerical trajectories are 
confined to the level sets of a modified Hamiltonian 
given by an O(e^) perturbation of the exact Hamilto¬ 
nian (Hairer et ah, 2006; Betancourt et ah, 2014a). 

Although such symplectic integrators are highly accu¬ 
rate, they still introduce an error into the trajecto¬ 
ries that can bias the Markov chain and any resulting 
Monte Carlo estimators. In practice this error is typi¬ 
cally compensated with the application of a Metropo¬ 
lis acceptance procedure, accepting a point along the 
numerical trajectory only with probability 

a{p,q) = min {l,eyip{H{q,p) - H o (j)^.,.{q,p)'^'^ . 
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A critical reason for the scalable performance of such 
an implementation of Hamiltonian Monte Carlo is that 
the error in a symplectic integrator scales with the step 
size, e. Consequently a small bias or a large acceptance 
probability can be maintained by reducing the step 
size, regardless of the complexity or dimension of the 
target distribution (Betancourt et ah, 2014a). If the 
symplectic integrator is compromised, however, then 
this scalability and generality is lost. 


left biased. 

Unlike the numerical trajectories from the full Hamil¬ 
tonian, these subsampled trajectories are biased away 
from the exact trajectories regardless of the chosen 
step size. In particular, the bias of each step, 

o o -f O(e^) 

= -f 0{€^) , 


3. Hamiltonian Monte Carlo With 
Subsampling 

A common criticism of Hamiltonian Monte Carlo is 
that in data-intensive applications the application of 
potential energy operator, 


where 


AU, = - 


dV dVA d 
- J- 


dq dq J dp' 
persists over an entire trajectory, 

o o + 0{e^) . 


( 1 ) 


dq dp' 

and hence the simulation of numerical trajectories, can 
become infeasible given the expense of the gradient cal¬ 
culations. This expense has fueled a variety of modi¬ 
fications of the algorithm aimed at reducing the cost 
of the potential energy operator, often by any means 
necessary. 

An increasingly popular strategy targets Bayesian ap¬ 
plications where the data are independently and iden¬ 
tically distributed. In this case the posterior can be 
manipulated into a product of contributions from each 
subset of data, and the potential energy likewise de¬ 
composes into a sum, V{q) = J2j=i where each 

{Vj} depends on only a single subset. This decom¬ 
position suggests algorithms which consider not the 
entirety of the data and the full potential energy, V, 
but rather only a few subsets at a time. 

The performance of any such subsampling method de¬ 
pends critically on the details of the implementation 
and the structure of the data itself. Here I consider 
the performance of two immediate implementations, 
one based on subsampling the data in between Hamil¬ 
tonian trajectories and one based on subsampling the 
data within a single trajectory. Unfortunately, the per¬ 
formance of both methods leaves much to be desired. 

3.1. Subsampling Data In Between 
Trajectories 


As the dimension of the target distribution grows, 
the subsampled gradient, JdVj/dq, drifts away from 
the true gradient, dVjdq, unless the data become in¬ 
creasingly redundant. Consequently the resulting tra¬ 
jectory introduces an irreducible bias into the algo¬ 
rithm, similar in nature to the asymptotic bias seen in 
subsampled Langevin Monte Carlo (Teh et ah, 2014; 
Vollmer et ah, 2015), which then induces either a van¬ 
ishing Metropolis acceptance probability or highly- 
biased expectations if the Metropolis procedure is ig¬ 
nored outright (Figure 1). 

Unfortunately, the only way to decrease the depen¬ 
dency on redundant data is to increase the size of each 
subsample, which immediately undermines any com¬ 
putational benefits. 

Consider, for example, a simple application where we 
target a one-dimensional posterior distribution, 

p{p\x) (X p{x\p,) p{p), (2) 

with the likelihood 

N 

pix\p) = A/'(x„|/r,cr^) 

n—1 

and prior 

P{p) = JV{p\m,s‘^) . 

Separating the data into J = N/B batches of size 
B and decomposing the prior into J individual terms 
then gives 


Given any subset of the data, we can approximate the 
potential energy as U ^ JVj and then generate tra¬ 
jectories corresponding to the flow of the approximate 
Hamiltonian, Hj = T + J Vj. In order to avoid parsing 
the entirely of the data, the Metropolis acceptance pro¬ 
cedure can be neglected and the corresponding samples 


V-i = const -I- 


B a^+ Ns^ 
N 


(iE 


n={j-l)B+l • 


Ns^ 


0-2 + Ns"^ 
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Exact 


Figure 1. The bias induced by subsampling data in Hamiltonian Monte Carlo depends on how precisely the gradients 
of the subsampled potential energies integrate to the gradient of the true potential energy, (a) When the subsampled 
gradient is close to the true gradient, the stochastic trajectory will follow the true trajectory and the bias will be small, 
(b) Conversely, if the subsampled gradient is not close to the true potential energy then the stochastic trajectory will 
drift away from the true trajectory and induce a bias. Subsampling between trajectories requires that each subsampled 
gradient approximate the true gradient, while subsampling within a single trajectory requires only that the average of 
the subsampled gradients approximates the true gradient. As the dimension of the target distribution grows, however, an 
accurate approximation in either case becomes increasingly more difficult unless the data become correspondingly more 
redundant relative to the complexity of the target distribution. 


Here I take ct = 2, m = 0, s = 1, and generate N = 500 
data points assuming ^ = 1. 

When the full data are used, numerical trajectories 
generated by the second-order symplectic integrator 
constructed above closely follow the true trajectories 
(Figure 2a). Approximating the potential with a sub¬ 
sample of the data introduces the aforementioned bias, 
which shifts the stochastic trajectory away from the 
exact trajectory despite negligible error from the sym¬ 
plectic integrator itself (Figure 2b). Only when the 
size of each subsample approaches the full data set, 
and the computational benefit of subsampling fades, 
does the stochastic trajectory provide a reasonable ap¬ 
proximation to the exact trajectory (Figure 2c) 

As noted above, geometric considerations suggest that 
this bias should grow with the dimensionality of the 
target distribution. To see this, consider running 
Hamiltonian Monte Carlo, implemented with the same 
second-order symplectic integrator using a step size, 
e, and a random integration time for each trajectory, 
T ^ [7(0, 27r), on the multivariate generalization of (2), 

D 

Y[p{fJ-d\xd), (3) 

d=l 

where the true pd are sampled from fid ~ ^7(0,1). As 
a surrogate for the accuracy of the resulting samples I 
will use the average Metropolis acceptance probability 
using the full data. 


size of the symplectic integrator can be tuned to main¬ 
tain constant accuracy as the dimensionality of the 
target distribution, D, increases. The bias induced 
by subsampling between trajectories, however, is in¬ 
variant to the step size of the integrator and rapidly 
increases with the dimension of the target distribution. 
Here the data were partitioned into J = 25 batches of 
B = 20 data, the subsample used for each trajecto¬ 
ries randomly selected from the first five batches, and 
the step size of the subsampled trajectories reduced by 
N/{J ■ B) = 5 to equalize the computational cost with 
full data trajectories (Figure 3). 

3.2. Subsampling Data within a Single 
Trajectory 

Given that using a single subsample for an entire tra¬ 
jectory introduces an irreducible bias, we might next 
consider subsampling at each step within a single tra¬ 
jectory, hoping that the bias from each subsample 
cancels in expectation. Ignoring any Metropolis cor¬ 
rection, this is exactly the naive stochastic gradient 
Hamiltonian Monte Carlo of (Chen et ah, 2014). 

To understand the accuracy of this strategy consider 
building up such a stochastic trajectory one step at a 
time. Given the first two randomly-selected subsam¬ 
ples, Vi and then Vj, the first two steps of the resulting 


When the full data are used in this model, the step 
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Full Data (J= 1,B = 500) 


Exact Level Set - 

Modified Level Set - - - 



(a) 


Small Subset (J = 50, B = 10) 


Exact Level Set - 

Modified Level Set - - • 
Exact Stochastic Level Set 
Modified Stochastic Level Set 




Full Data 


Subsampled Between 
Subsampled Within 


Figure 3. When the full data are used, high accuracy of 
Hamiltonian Monte Carlo samples, here represented by the 
average Metropolis acceptance probability using the full 
data, can be maintained even as the dimensionally of the 
target distribution grows. The biases induced when the 
data are subsampled, however, cannot be controlled and 
quickly devastate the accuracy of the algorithm. Here the 
step size of the subsampled algorithms has been decreased 
relative to the full data algorithm in order to equalize the 
computational cost - even in this simple example, a proper 
implementation of Hamiltonian Monte Carlo can achieve a 
given accuracy much more efficiently than subsampling. 


integrator are given by 


Large Subset (J = 2, B = 250) 


Exact Level Set - 

Modified Level Set - - ■ 
Exact Stochastic Level Set 
Modified Stochastic Level Set 



(c) 


= expf2ei) — e (AVi + AV 


^2 r 


+■ 


2 L 


H - AVj,H - AVi 


+ 0{e^) 




where we have used the fact that the {Af j } commute 
with each other. Similarly, the first three steps are 
given by 


Figure 2. Even for the one-dimensional target distribution 
(2), subsampling data in between Hamiltonian trajectories 
introduces significant pathologies, (a) When the full data 
are used, numerical Hamiltonian trajectories (dashed line) 
closely track the exact Hamiltonian trajectories (solid line). 
Subsampling of the data introduces a bias in both the exact 
trajectories and corresponding numerical trajectories, (b) 
If the size of each subsample is small then this bias is large, 
offsetting both the exact and numerical stochastic trajecto¬ 
ries. (c) Only when the size of the subsamples approaches 
the size of the full data, and any computational benefits 
from subsampling wane, do the stochastic trajectories pro¬ 
vide a reasonable emulation of the true trajectories. 


o o 6^' 


= exp 3ei7 - e -h AV, + AVi 


)(3ei7-e( 

4 (- 


H. Ay,, 


AVj,H 




H - AVk,2H - AVi - Ay, 
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= exp^Sef? — e (^AVi + AVj + AV 


and, letting ji denote the subsample chosen at the l- 
th step, the composition over an entire trajectory be¬ 
comes 




= exp I (Le) H - (Le) - ^ AVj 


+ (Le) e ^ 
+ {Le)0{e^) 


1^1 


H,AVj, 


H, AVj 


( 1 ^ _ 
= exp ( tH — Ty 




( 


H,AV,, 


+Te 

= exp ( tH + tBi + tB2 


H,AVj, 

0{e^), 


))+o{^) 


where 

1 = 1 

and 

Once again, subsampling the data introduces bias into 
the numerical trajectories. 


B 2 — 


H,AV,, 


H,AV 


JL 


In order to ensure that the bias vanishes identically 
and independent of the redundancy of the data, we 
have to use each subsample the same number of times 
within a single trajectory. In particular, both biases 
vanish if we use each subsample twice in a symmetric 
composition of the form 

Because this composition requires using all of the sub¬ 
samples it does not provide any computational savings 
and it seems rather at odd with the original stochastic 
subsampling motivation. 

Indeed, this symmetric composition is not stochas¬ 
tic at all and actually corresponds to a rather elab¬ 
orate symplectic integrator with an effective step size 
of Je, where the potential energy from each subsam¬ 
ple generates its own flow, equivalent to the integrator 
in Split Hamiltonian Monte Carlo (Shahbaba et ah, 
2014). Removing intermediate steps from this sym¬ 
metric, stochastic trajectory (Figure 4a) reveals the 
level set of the corresponding modified Hamiltonian 
(Figure 4b). Because this symmetric composition in¬ 
tegrates the full Hamiltonian system, the error is once 
again controllable and vanishes as the step size is de¬ 
creased (Figure 4c). 

Limiting the number of subsamples, however, leaves 
the irreducible bias in the trajectories that cannot be 
controlled by the tuning the step size (Figures 3, 5). 
Once more we are left dependent on the redundancy of 
the data for any hope of improved performance with 
subsampling. 


Although the second source of bias, B 2 , is immediately 
rectified by appending the stochastic trajectory with 
an update from the initial subsample such that Jl = 
ji, the first source of bias, Hi, is not so easily remedied. 
Expanding, 



n—1 


ydq L ^ dq j dp' 

we see that Bi vanishes only when the average gradi¬ 
ent of the selected subsamples yields the gradient of 
the full potential. Averaging over subsamples may re¬ 
duce the bias compared to using a single subsample 
over the entire trajectory (I), but the bias still scales 
poorly with the dimensionality of the target distribu¬ 
tion (Figure I). 


4. Conclusion 

The efficacy of Markov Chain Monte Carlo for com¬ 
plex, high-dimensional target distributions depends 
on the ability of the sampler to explore the intricate 
and often meandering neighborhoods on which prob¬ 
ability is distributed. Symplectic integrators admit 
a structure-preserving implementation of Hamiltonian 
Monte Carlo that is amazingly robust to this complex¬ 
ity and capable of efficiently exploring the most com¬ 
plex target distributions. Subsampled data, however, 
does not in general have enough information to en¬ 
able such efficient exploration; this lack of information 
manifests as an irreducible bias that devastates the 
scalable performance of Hamiltonian Monte Carlo. 

Consequently, without having access to the full data 
there is no immediate way of engineering a well- 
behaved implementation of Hamiltonian Monte Carlo 
applicable to most statistical models. As with so many 
other subsampling algorithms, the adequacy of a sub¬ 
sampled Hamiltonian Monte Carlo implementation is 
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e = 0.05 
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Modified Level Set 
Numerical Trajectory 
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Subsampled Trajectory • 
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(a) 


e = 0.0005 


Exact Level Set - 

Subsampled Trajectory • 



(c) 


(b) 


Figure 4. The symmetric composition of flows from each 
subsamples of the data eliminates all bias in the stochastic 
trajectory because it implicitly reconstructs a symplectic 
integrator. Refining (a) all intermediate steps in a stochas¬ 
tic trajectory (b) to only those occurring after a symmetric 
sweep of the subsamples reveals the level set of the modi¬ 
fied Hamiltonian corresponding to the implicit symplectic 
integrator. Because of the vanishing bias, (c) the error in 
the stochastic trajectory can be controlled by taking the 
step size to zero. 


Figure 5. (a) Utilizing only a few subsamples within a tra¬ 
jectory yields numerical trajectories biased away from the 
exact trajectories, (b) Unlike the error introduced by a 
full symplectic integrator, this bias is irreducible and can¬ 
not be controlled by tuning the step size. The performance 
of such an algorithm is limited by the size of the bias which 
itself depends on the redundancy of the data relative to the 
target model. 
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at the mercy of the redundancy of the data relative 
to the complexity of the target model, and not in the 
control of the user. 

Unfortunately many of the problems at the frontiers of 
applied statistics are in the wide data regime, where 
data are sparse relative to model complexity. Here 
subsampling methods have little hope of success; we 
must focus our efforts not on modifying Hamiltonian 
Monte Carlo but rather on improving its implemen¬ 
tation with, for example, better memory management 
and efficiently parallelized gradient calculations. 
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